Filtering Image Spam with Near-Duplicate Detection

نویسندگان

  • Zhe Wang
  • William K. Josephson
  • Qin Lv
  • Moses Charikar
  • Kai Li
چکیده

A new trend in email spam is the emergence of image spam. Although current anti-spam technologies are quite successful in filtering text-based spam emails, the new image spams are substantially more difficult to detect, as they employ a variety of image creation and randomization algorithms. Spam image creation algorithms are designed to defeat well-known vision algorithms such as optical character recognition (OCR) algorithms whereas randomization techniques ensure the uniqueness of each image. We observe that image spam is often sent in batches that consist of visually similar images that differ only due to the application of randomization algorithms. Based on this observation, we propose an image spam detection system that uses near-duplicate detection to detect spam images. We rely on traditional anti-spam methods to detect a subset of spam images and then use multiple image spam filters to detect all the spam images that “look” like the spam caught by traditional methods. We have implemented a prototype system to achieve high detection rate while having a less than 0.001% false positive rate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimized near Duplicate Matching scheme for E-mail Spam Detection

Today the major problem that the people are facing is spam mails or e-mail spam. In recent years there are so many schemes are developed to detect the spam emails. Here the primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by users feedback, to block the subsequent near-duplicate spam’s. We propose a novel e-mail abstraction scheme, w...

متن کامل

Improved near Duplicate Matching Scheme for E-mail Spam Detection

Today the major problem that the people are facing is spam mails or e-mail spam. In recent years there are so many schemes are developed to detect the spam emails. Here the primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user’s feedback, to block the subsequent near-duplicate spam’s. We propose a novel e-mail abstraction scheme, ...

متن کامل

Accurate Spam Mail Detection

With the increasing popularity of a E-mail users, E-mail spam problem growing proportionally. Spam filtering with near duplicate matching scheme is widely discussed in recent years. It is based on a known spam database formed by user feedback which cannot fully catch the evolving nature of spam and also it requires much storage. In view of above drawbacks, we proposed an effective spam detectio...

متن کامل

A Sobel Edge Detection Algorithm Based System for Analyzing and Classifying Image Based Spam

Early spam mails were only text-based, however spammers have moved to more sophisticated spamming techniques that involve images now generally termed image based spam. In most image-based spam, the entire spam message, which could be sometimes text, is embedded in an image of any format. This type of spam emails creates another dimension to the spam filtering problem scenario. Extracting text f...

متن کامل

Trends in Combating Image Spam E-mails

With the rapid adoption of Internet as an easy way to communicate, the amount of unsolicited e-mails, known as spam e-mails, has been growing rapidly. The major problem of spam e-mails is the loss of productivity and a drain on IT resources. Today, we receive spam more rapidly than the legitimate e-mails. Initially, spam e-mails contained only textual messages which were easily detected by the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007